In this session, we will learn basics of data visualization with R using the ggplot2 package, which is part of the tidyverse ecosystem that you used last week. We will also take a look at the esquisse package for click-and-go creation of the ggplot2 graphics. Lastly, we will briefly touch base on the interactive graphics with plotly package.
R has built-in functions such as plot(), barplot(), hist(), etc. to visualize data. Nevertheless, creating visuals with base R functions is not as efficient as using ggplot2 functions. ggplot2 has a well established grammar of graphics and intuitive work flow with the + operator, just like the pipe (%>%) operator that you learned last week.
We will use one of the data sets that we used last week, easyCBM, which contains oral reading fluency (ORF) scores and some demographic variables of young readers. Let’s first load the tidyverse, sjmisc, and sjlabelled packages and read the easyCBM SPSS data file. Please note that we will be using sjmisc and sjlabelled packages for exploring data.
library(tidyverse)
library(sjmisc)
library(sjlabelled)
easyCBM <- read_spss("easyCBM.sav")
Okay! Let’s take a quick look at the data set and remember what information does it contain.
easyCBM %>% slice(1:5)
What about the descriptive statistics for the continuous (e.g., ORF scores) variables?
easyCBM %>%
select(ORF1:Test_A) %>%
descr()
## Basic descriptive statistics
var type label n NA.prc mean sd se md trimmed range iqr skew
ORF1 numeric Oral fluency score Fall 250 0 143.42 40.44 2.56 145.0 143.66 221 (31-252) 56.00 -0.02
ORF2 numeric Oral fluency score Winter 250 0 150.09 40.29 2.55 148.0 149.62 229 (38-267) 56.50 0.07
ORF3 numeric Oral fluency score Spring 250 0 164.98 37.88 2.40 168.0 166.23 197 (56-253) 50.75 -0.31
MCC1 numeric Reading test score Fall 250 0 13.92 3.40 0.22 15.0 14.22 16 (4-20) 5.00 -0.77
MCC2 numeric Reading test score Winter 250 0 14.41 2.94 0.19 15.0 14.78 16 (4-20) 3.00 -1.32
MCC3 numeric Reading test score Spring 250 0 15.89 3.08 0.19 16.5 16.26 15 (5-20) 3.00 -1.08
VOC1 numeric Vocabulary score Fall 250 0 18.31 4.79 0.30 19.5 18.84 21 (4-25) 7.00 -0.85
VOC3 numeric Vocabulary score Spring 250 0 20.23 4.11 0.26 22.0 20.76 18 (7-25) 5.00 -1.18
Test_A numeric Statewide Test Score 250 0 224.38 9.91 0.63 224.0 224.62 55 (198-253) 12.00 -0.17
And frequency tables for the ethnicity, gender, and proficiency as categorical variables (We will use those three for graphics).
easyCBM %>%
select(Ethnic, Gender, Prof) %>%
frq()
Ethnicity code (Ethnic) <numeric>
# total N=250 valid N=250 mean=4.52 sd=1.45
Value | Label | N | Raw % | Valid % | Cum. %
----------------------------------------------------------
1 | Amer Ind/Alsk Nat | 10 | 4 | 4 | 4
2 | Asian/Pac Isl | 20 | 8 | 8 | 12
3 | Black | 20 | 8 | 8 | 20
4 | Hispanic | 50 | 20 | 20 | 40
5 | White | 100 | 40 | 40 | 80
6 | Multi-Ethnic | 30 | 12 | 12 | 92
7 | Decline | 20 | 8 | 8 | 100
<NA> | <NA> | 0 | 0 | <NA> | <NA>
Gender (Gender) <numeric>
# total N=250 valid N=250 mean=0.50 sd=0.50
Value | Label | N | Raw % | Valid % | Cum. %
------------------------------------------------
0 | Males | 126 | 50.40 | 50.40 | 50.40
1 | Females | 124 | 49.60 | 49.60 | 100.00
<NA> | <NA> | 0 | 0.00 | <NA> | <NA>
Reading Proficiency status (Prof) <numeric>
# total N=250 valid N=250 mean=0.81 sd=0.39
Value | Label | N | Raw % | Valid % | Cum. %
------------------------------------------------------
0 | Not Meet | 47 | 18.80 | 18.80 | 18.80
1 | Meets/Exceeds | 203 | 81.20 | 81.20 | 100.00
<NA> | <NA> | 0 | 0.00 | <NA> | <NA>
Now, we have refreshed our memory about the easyCBM data. I would like to recode the numerical values of the categorical variables back to nominal values with their labels (and trim the ethnicity variable). Also, I will convert them to factors, which is the class of objects that have levels. Factor is a useful object type in R (an R way of saying “categorical variable”) and is used in statistical analyses, as well as in graphics. Using the argument as.num=F within the rec() function will do this conversion automatically.
easyCBM <-
easyCBM %>%
mutate(Gender=rec(Gender, rec = "0=Male; 1=Female; else=NA", as.num = F),
Ethnic=rec(Ethnic, rec = "2=Asian; 3=Black; 4=Hispanic; 5=White; else=Other", as.num = F),
Prof=rec(Prof, rec = "0=Fail; 1=Pass; else=NA", as.num = F))
ggplot2()and Use of AestheticsLet’s get started with a basic bar graph that shows the frequency of gender. The main function to create graphics with ggplot2 is ggplot(). Let’s see what would happen if we apply this function to the easyCBM data set.
easyCBM %>%
ggplot()
So, by applying the base ggplot() function, we obtained an empty graph (i.e., a coordinate system ) for the graphic. Please note that we haven’t indicated anything about gender yet. The next step is to describe the aesthetics of the graph by using the aes() argument for mapping. In aesthetics, we define the axes, shapes, colors etc. As stated by Wickham & Grolemund (2017) mapping defines how variables are mapped to visual properties of the graph, in other words the aesthetics. Thus, in the code below, we tell ggplot() that the values of x-axis will be identified by gender variable of the easyCBM data. Please note that I will not be using the mapping= argument in the following code chunks to make the codes look simpler.
easyCBM %>%
ggplot(mapping=aes(x=Gender))
Okay! Now, it is time to declare the type of the graph we want: A bar chart! In ggplot2 language, after we create the coordinate system, we start adding layers to our graphs. Layers starts with the type of the graph (e.g., bar, histogram, line, etc.), which is referred to as geometry. We will use the geom_bar() function for the bar chart. Similarly, other type of chart geometries uses the geom_ prefix such as geom_histogram() or geom_line(). Very intuitive, isn’t it?
easyCBM %>%
ggplot(aes(x=Gender)) +
geom_bar()
We can check quickly what other types of geometries are available in ggplot2 by using the handy apropos() function. Please note that some of the geom functions below are used to enhance the main graphic rather than as a standalone type of geometry. For example, geom_hline() is used to add a horizontal line to the main graphic.
apropos("geom_")
[1] "geom_abline" "geom_area" "geom_bar" "geom_bin2d"
[5] "geom_blank" "geom_boxplot" "geom_col" "geom_contour"
[9] "geom_contour_filled" "geom_count" "geom_crossbar" "geom_curve"
[13] "geom_density" "geom_density_2d" "geom_density_2d_filled" "geom_density2d"
[17] "geom_density2d_filled" "geom_dotplot" "geom_errorbar" "geom_errorbarh"
[21] "geom_freqpoly" "geom_function" "geom_hex" "geom_histogram"
[25] "geom_hline" "geom_jitter" "geom_label" "geom_line"
[29] "geom_linerange" "geom_map" "geom_path" "geom_point"
[33] "geom_pointrange" "geom_polygon" "geom_qq" "geom_qq_line"
[37] "geom_quantile" "geom_raster" "geom_rect" "geom_ribbon"
[41] "geom_rug" "geom_segment" "geom_sf" "geom_sf_label"
[45] "geom_sf_text" "geom_smooth" "geom_spoke" "geom_step"
[49] "geom_text" "geom_tile" "geom_violin" "geom_vline"
[53] "match_geom_args" "update_geom_defaults"
Great! We now have a beautiful (kind of) bar chart. Now let’s make it look better. We will improve the look of the graph by adding layers with the + operator.
easyCBM %>%
ggplot(aes(x=Gender)) +
geom_bar(fill="blue", width=.5) +
ggtitle("Frequency of Gender") +
labs(y="Frequency")
Now, let’s take look at the new code chunk below and I’ll copy the previous code that we used for 1-1 comparison. It seems like both codes produced the exact same graph, however, they have some differences in terms of using the aes() argument. In the new code, aes() was used within the geom_bar() function. Although it didn’t make a difference with this example, it has an important implication with more complex graphs such as lines and points combined with different geom_ functions. We will see an example for that later but for now, I recommend adopting the new approach, where we map aesthetics of the graph within relevant geom_ function.
#New Code
easyCBM %>%
ggplot() +
geom_bar(aes(x=Gender), fill="blue", width=.5) +
ggtitle("Frequency of Gender") +
labs(y="Frequency")
#Old Code
#easyCBM %>%
# ggplot(aes(x=Gender)) +
# geom_bar(fill="blue", width=.5) +
# ggtitle("Frequency of Gender") +
# labs(y="Frequency")
Now, let’s create a scatter plot to see the the relationship between the ORF scores in Fall (ORF1) and Winter (ORF2).Did you realize the new aesthetics that we defined? Since it is a scatter plot, we need to define the y-axis as well. And of course, we used geom_point() for the scatter plot.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2))
Now, say we want to change the color of the points to blue. Let’s try this.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color="blue"))
Oops! It seems like R crashed or it became color blind. Nope. It was our fault. Remember, we learned that aes() maps the visual characteristics to our variables within the data set. Here, however, we just wanted to change the color of all points to blue. Thus, we are not associating the color to any attribute of data variable. The solution is: anything that needs to be general to the plot should be indicated out of the aes() argument. Let’s solve this problem!
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2), color="blue")
Now, assume that we intentionally want to associate the color with some data attributes. In other words, let’s assume we want to display the relationship of the ORF1 and ORF2 scores in different colors for male and female students. I think you know the answer for how to do that!
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Gender))
Did you realize that a legend was automatically created? What else can we map to gender besides color? Well, shape of the points, size, etc.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Gender, shape=Gender, size=Gender))
Okay, a last step: how to assign a different set of colors? It seems like ggplot2 selected its favorite colors. Let’s use SMU’s official colors in hexadecimal color codes. I will also show some extra layers for changing characteristics of the graph and adding a smoothed line to imply the relationship between the two ORF scores.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Gender)) +
scale_color_manual(values = c("#CC0035", "#354CA1")) +
scale_y_continuous(lim=c(0, 300), breaks = as.numeric(seq(0, 300, 100))) +
scale_x_continuous(lim=c(0, 300), breaks = as.numeric(seq(0, 300, 100))) +
geom_smooth(aes(x=ORF1, y=ORF2), color="black", linetype="dashed") +
theme_bw()
Now, it is time to remember the note on the use of aes() either within each geom_ function or within the main ggplot() function. As you realize in the code above, we have some common arguments for the aesthetics of the geom_point() and geom_smooth(). We can use those global aesthetics within the ggplot() function. This helps us save some time by eliminating the repetition of those aesthetic per geom_. Let’s see if this would work.
easyCBM %>%
ggplot(aes(x=ORF1, y=ORF2)) +
geom_point(aes(color=Gender)) +
scale_color_manual(values = c("#CC0035", "#354CA1")) +
scale_y_continuous(lim=c(0, 300), breaks = as.numeric(seq(0, 300, 100))) +
scale_x_continuous(lim=c(0, 300), breaks = as.numeric(seq(0, 300, 100))) +
geom_smooth(color="black", linetype="dashed") +
theme_bw()
Up to this point, we visualized our data in a single panel/window. In other words, we used different visual characteristics such as color to present the information for various groups. We might want to create separate panels for this aim. Now, let’s create the previous scatter plot in two panels for males and females. We will use the facet_wrap() function to accomplish this. I’ll remove the all added extra layers to the previous graph to highlight the use of new function and prevent the complex look of the code chunk.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2)) +
facet_wrap(~Gender)
We can also change the orientation of the panels.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2)) +
facet_wrap(~Gender, nr=2)
Let’s make the visual more complex by mapping the color of the points to the proficiency levels as Fail/Pass. Since we saved an attribute (Gender) by using multi-panels, we can use color mapping for the proficiency.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Prof)) +
facet_wrap(~Gender, nr=2)
As its name implies, facet_wrap() wraps the same specifications of the core visual along with the levels of the desired categorical (actually it needs to be a factor!) variable, which is gender in this example. Now, let’s assume that we want to display the relationship of the two ORF scores for the combination of the gender and ethnicity. For that, we will use facet_grid(). You might ask “Can we do this by facet_wrap()?”. Well, yes, let’s take a look at this option first.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Prof)) +
facet_wrap(~Gender+Ethnic)
It does not look good (in my opinion). Let’s adjust the panel orientation.
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Prof)) +
facet_wrap(~Gender+Ethnic, nc=5)
Now, let’s use facet_grid().
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Prof)) +
facet_grid(Gender~Ethnic)
So which one looks better? Were you able to spot the differences? And finally, let’s make the plot looks better by applying the functions that we’ve learned so far. Please note that in the code below, we assigned a name (gg_last) to our graph and we can display it just by typing gglast.
gg_last <-
easyCBM %>%
ggplot() +
geom_point(aes(x=ORF1, y=ORF2, color=Prof)) +
scale_color_manual(values = c("#CC0035", "#354CA1")) +
scale_y_continuous(lim=c(0, 300), breaks = as.numeric(seq(0, 300, 100))) +
scale_x_continuous(lim=c(0, 300), breaks = as.numeric(seq(0, 300, 150))) +
facet_grid(Gender~Ethnic) +
theme_bw() +
ggtitle("Scatter Plot of ORF Scores") +
theme(plot.title = element_text(hjust = 0.5))
gg_last
Esquisse PackageIf you are more a Tableau person, you might want to try esquisse package, which makes it possible to create ggplot2() graphics interactively. It can also be used as a good hands-on activity to strengthen your knowledge of ggplot2 grammar. Nevertheless, it can not produce very complicated graphs as the ones created by manual coding. Now, we’ll upload the esquisse package and do some graphing interactively within R Studio. (Please make sure to install the esquisse package).
library(esquisse)
#esquisser()
So far, we have learned how to create beautiful graphs by using ggplot2 package. We can take this one step further by producing interactive graphs, still using R and R studio. The idea of interactive graphs also perfectly fits to the capability of R studio in creating html-based documents. Thus, we can take the advantage of this format by making our visuals interactive and attractive.
We will use the plotly package for the creation of interactive plots. Please make sure to install plotly and load it.
library(plotly)
One quick and easy way to create the interactive version of a previously created ggplot2 graph is using the ggplotly() function of the plotly package. Let’s make the latest graph we created interactive.
gg_last %>%
ggplotly()
It is possible to customize the specs of the interactive graph through plotly package functions. Here are some examples.
gg_last %>%
ggplotly(width = 750, height=600) %>%
layout(legend = list(xanchor="left", x=1.1, y=0.95, title="top left"))
We can also use plot_ly function of the plotly package to create interactive graphs from scratch. Let’s draw a simple box plot of the ORF1 scores grouped for each ethnicity.
easyCBM %>%
plot_ly(x = ~ORF1, color = ~Ethnic, type = "box")
Now, let’s draw a violin plot to present the similar information as with the previous boxplot.
easyCBM %>%
plot_ly(y=~ORF1,
x=~Ethnic,
color = ~Ethnic,
box = list(visible = T),
type="violin")
More examples of plot_ly() can be found at https://plotly.com/r/.